In [1]:
import numpy as np
import pandas as pd
import pickle as pkl
import dalex as dx
from lime import lime_tabular
import random
import warnings

warnings.filterwarnings("ignore")

Model and data loading¶

In [2]:
data = pd.read_pickle("resources/data/housing_preproc.pkl")
In [3]:
with open("resources/models/neural_network.pkl", "rb") as file:
    mlp = pkl.load(file)
In [4]:
X = data.drop(columns=["median_house_value"])
y = data["median_house_value"]

Selecting observations and calculating predictions¶

In [5]:
index = [221, 420, 2137]
In [6]:
observation_1 = X.iloc[[index[0]]]
observation_2 = X.iloc[[index[1]]]
observation_3 = X.iloc[[index[2]]]
In [7]:
prediction_1 = mlp.predict(observation_1)
prediction_2 = mlp.predict(observation_2)
prediction_3 = mlp.predict(observation_3)

print(
    "Real value of observation 1: {y_1:.4f}; predicted value: {y_1_hat:.4f}".format(
        y_1=list(y.iloc[[index[0]]])[0], y_1_hat=prediction_1[0]
    )
)
print(
    "Real value of observation 2: {y_2:.4f}; predicted value: {y_2_hat:.4f}".format(
        y_2=list(y.iloc[[index[1]]])[0], y_2_hat=prediction_2[0]
    )
)
print(
    "Real value of observation 3: {y_3:.4f}; predicted value: {y_3_hat:.4f}".format(
        y_3=list(y.iloc[[index[2]]])[0], y_3_hat=prediction_3[0]
    )
)
Real value of observation 1: -0.4840; predicted value: -0.5153
Real value of observation 2: 1.4216; predicted value: 1.5363
Real value of observation 3: -1.0343; predicted value: -1.2257

Breakdown and shapley plots¶

In [8]:
explainer1 = dx.Explainer(mlp, X, y, label="California Housing")
Preparation of a new explainer is initiated

  -> data              : 20640 rows 13 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 20640 values
  -> model_class       : sklearn.model_selection._search.GridSearchCV (default)
  -> label             : California Housing
  -> predict function  : <function yhat_default at 0x000001C6F19CDFC0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = -1.63, mean = -0.0198, max = 3.15
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -3.0, mean = 0.0198, max = 3.79
  -> model_info        : package sklearn

A new explainer has been created!
In [9]:
order = X.columns.to_list()
In [10]:
random.seed(1)
explainer1.predict_parts(observation_1.iloc[[0]], type="break_down", order=order).plot()
explainer1.predict_parts(observation_1.iloc[[0]], type="shap").plot()

Break down plot shows that longitude has the largest positive contribution to prediction while latitude has the largest negative contribution. Shapley plot gives opposite results, so we may expect that those two variables are interacting. Moreover ocean_proximity_INLAND influence changes from slightly positive to more negative so we might consider ocean_proximity categorical variable and coordinates to come down to location, not just a single factor.

In [11]:
random.seed(1)
explainer1.predict_parts(observation_2.iloc[[0]], type="break_down", order=order).plot()
explainer1.predict_parts(observation_2.iloc[[0]], type="shap").plot()

Similar relation of longitude and latitude variables can be noticed. Total_rooms have much greater impact on Break down plot. Population contribution changes from positive to negative with may suggest that this variable has some interaction with another factor.

In [12]:
random.seed(1)
explainer1.predict_parts(observation_3.iloc[[0]], type="break_down", order=order).plot()
explainer1.predict_parts(observation_3.iloc[[0]], type="shap").plot()

Break down plot of this observation shows that only longitude have positive impact on prediction, hovewer as noticed before it may be interacting with longitude.

LIME decomposition¶

In [13]:
explainer2 = lime_tabular.LimeTabularExplainer(
    X.values,
    feature_names=order,
    class_names=["median_house_value"],
    categorical_names=["ocean_proximity"],
    mode="regression",
)
In [14]:
random.seed(1)
explainer2.explain_instance(observation_1.values[0], mlp.predict).show_in_notebook()
In [15]:
random.seed(1)
explainer2.explain_instance(observation_2.values[0], mlp.predict).show_in_notebook()
In [16]:
random.seed(1)
explainer2.explain_instance(observation_3.values[0], mlp.predict).show_in_notebook()

Results of LIME decomposition ensure that pricing mainly depends on location described by 3 variables - longitude, latitude and ocean_proximity. However, a third observation shows that ocean proximity can lower the importance of coordinates significantly. Housing_median_age variable seems to be more stable as its values are equally important independently from other attributes of the house.